Change “your name” in the YAML header above to your name.
As usual, enter the examples in code chunks and run them, unless told otherwise.
Read R4ds Chapter 10: Tibbles, sections 1-3.
Load the tidyverse package.
library(tidyverse)
[30m-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --[39m
[30m[32mv[30m [34mggplot2[30m 3.1.0 [32mv[30m [34mpurrr [30m 0.2.5
[32mv[30m [34mtibble [30m 1.4.2 [32mv[30m [34mdplyr [30m 0.7.8
[32mv[30m [34mtidyr [30m 0.8.2 [32mv[30m [34mstringr[30m 1.3.1
[32mv[30m [34mreadr [30m 1.3.1 [32mv[30m [34mforcats[30m 0.3.0[39m
[30m-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[30m [34mdplyr[30m::[32mfilter()[30m masks [34mstats[30m::filter()
[31mx[30m [34mdplyr[30m::[32mlag()[30m masks [34mstats[30m::lag()[39m
Enter your code chunks for Section 10.2 here.
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
Describe what each chunk code does.
1: squares x and adds y 2: refers to non-syntactic names 3: columnized date entry that is customised and layed out to be easier to read
Enter your code chunks for Section 10.2 here.
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
nycflights13::flights %>%
print(n = 10, width = Inf)
nycflights13::flights %>%
View()
df <- tibble(
x = runif(5),
y = rnorm(5)
)
df$x
[1] 0.4134812 0.6645216 0.2313980 0.3303780
[5] 0.1114870
df[["x"]]
[1] 0.4134812 0.6645216 0.2313980 0.3303780
[5] 0.1114870
df[[1]]
[1] 0.4134812 0.6645216 0.2313980 0.3303780
[5] 0.1114870
df %>% .$x
[1] 0.4134812 0.6645216 0.2313980 0.3303780
[5] 0.1114870
df %>% .[["x"]]
[1] 0.4134812 0.6645216 0.2313980 0.3303780
[5] 0.1114870
Describe what each chunk code does. 1:shows first 10 rows and all columns that fit on screen, easier to work with large data frames 2:can print a specific data frame 3:scrollable view of the data frame 4:pulls out a single vairable 5:to use variables in a pipe
Answer the questions completely. Use code chunks, text, or both, as necessary.
1: How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame). Identify at least two ways to tell if an object is a tibble. Hint: What does as_tibble() do? What does class() do? What does str() do?
mtcars
as_tibble(mtcars)
is_tibble(mtcars)
[1] FALSE
class(mtcars)
[1] "data.frame"
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Using is_tibble for true or false as_tibble prints off the first ten observations class is used to find the class of an object str gives a summary of the table
2: Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?
df <- data.frame(abc = 1, xyz = "a")
df$x
df[, "xyz"]
df[, c("abc", "xyz")]
tbl <- as_tibble(df)
tbl$xyz
[1] a
Levels: a
tbl[, "xyz"]
tbl[, c("abc", "xyz")]
the difference is if the data frames have variables then it will cause a problem in the code
Read R4ds Chapter 11: Data Import, sections 1, 2, and 5.
Nothing to do here unless you took a break and need to reload tidyverse.
Do not run the first code chunk of this section, which begins with heights <- read_csv("data/heights.csv"). You do not have that data file so the code will not run.
Enter and run the remaining chunks in this section.
read_csv("a,b,c
1,2,3
4,5,6")
read_csv("The first line of metadata
The second line of metadata
x,y,z
1,2,3", skip = 2)
read_csv("# A comment I want to skip
x,y,z
1,2,3", comment = "#")
read_csv("1,2,3\n4,5,6", col_names = FALSE)
read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
read_csv("a,b,c\n1,2,.", na = ".")
1: What function would you use to read a file where fields were separated with “|”?
read_delim(
2: (This question is modified from the text.) Finish the two lines of read_delim code so that the first one would read a comma-separated file and the second would read a tab-separated file. You only need to worry about the delimiter. Do not worry about other arguments. Replace the dots in each line with the rest of your code.
file <- read_delim("file_csv", "1,2,3")
file <- read_delim("file_tsv", "1,2,3")
3: What are the two most important arguments to read_fwf()? Why? fwf_positions and fwf_widths specifiy by positions or widths of fixed width files
4: Skip this question
5: Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
2 parsing failures.
row col expected actual file
1 -- 2 columns 3 columns literal data
2 -- 2 columns 3 columns literal data
read_csv("a,b,c\n1,2\n1,2,3,4")
2 parsing failures.
row col expected actual file
1 -- 3 columns 2 columns literal data
2 -- 3 columns 4 columns literal data
read_csv("a,b\n\"1")
2 parsing failures.
row col expected actual file
1 a closing quote at end of file literal data
1 -- 2 columns 1 columns literal data
read_csv("a,b\n1,2\na,b")
read_csv("a;b\n1;3")
1: number of columns doesnt match header 2: number of columns doesnt match header 3: column b is NA 4: a b are in the values 5: semicolon used instead of comma
Just read this section. You may find it helpful in the future to save a data file to your hard drive. It is basically the same format as reading a file, except that you must specify the data object to save, in addition to the path and file name.
Read R4ds Chapter 18: Pipes, sections 1-3.
Nothing to do otherwise for this chapter. Is this easy or what?
Note: Trying using pipes for all of the remaining examples. That will help you understand them.
Read R4ds Chapter 12: Tidy Data, sections 1-3, 7.
Nothing to do here unless you took a break and need to reload the tidyverse.
Study Figure 12.1 and relate the diagram to the three rules listed just above them. Relate that back to the example I gave you in the notes. Bear this in mind as you make data tidy in the second part of this assignment.
You do not have to run any of the examples in this section.
Read and run the examples through section 12.3.1 (gathering), including the example with left_join(). We’ll cover joins later.
table4a
tidy4a <- table4a %>%
gather(`1999`,`2000`, key = "year", value = "cases")
tidy4b <- table4b %>%
gather(`1999`,`2000`, key = "year", value = population)
left_join(tidy4a, tidy4b)
Joining, by = c("country", "year")
2: Why does this code fail? Fix it so it works.
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
#> Error in inds_combine(.vars, ind_list): Position must be between 0 and n
the years did not have back tickmarks on them `
That is all for Chapter 12. On to the last chapter.
Read R4ds Chapter 5: Data Transformation, sections 1-4.
Time to get small.
Load the necessary libraries. As usual, type the examples into and run the code chunks.
nycflights13::flights
filter()Study Figure 5.1 carefully. Once you learn the &, |, and ! logic, you will find them to be very powerful tools.
1.1: Find all flights with a delay of 2 hours or more.
nycflights13::flights
flights <- nycflights13::flights
filter(flights , arr_delay >= 120)
1.2: Flew to Houston (IAH or HOU)
filter(flights, dest == "IAH" | dest == "HOU")
1.3: Were operated by United (UA), American (AA), or Delta (DL).
nycflights13::airlines
filter(flights, carrier %in% c("AA", "DL", "UA"))
1.4: Departed in summer (July, August, and September).
filter(flights, month >= 7, month <= 9)
1.5: Arrived more than two hours late, but didn’t leave late.
filter(flights, dep_delay <= 0, arr_delay > 120)
1.6: Were delayed by at least an hour, but made up over 30 minutes in flight. This is a tricky one. Do your best.
filter(flights, dep_delay >= 60, dep_delay - arr_delay > 30)
1.7: Departed between midnight and 6am (inclusive)
filter(flights, dep_time <= 600 | dep_time == 2400)
2: Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?
finds the months inbetween 7&9 the summer months
filter(flights, between(month, 7, 9))
3: How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
filter(flights, is.na(dep_time))
arrival time, departure delay, arrival delay canceled flights
4: Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)
NA ^ 0
[1] 1
NA | TRUE
[1] TRUE
NA & FALSE
[1] FALSE
NA ^ 0 is not = 0 Note: For some context, see this thread
arrange()1: How could you use arrange() to sort all missing values to the start? (Hint: use is.na()). Note: This one should still have the earliest departure dates after the NAs. Hint: What does desc() do? places it in ascending order
arrange(flights, desc(is.na(dep_time)), dep_time)
2: Sort flights to find the most delayed flights. Find the flights that left earliest.
This question is asking for the flights that were most delayed (left latest after scheduled departure time) and least delayed (left ahead of scheduled time).
most delay: jan 9 2013 at 9:00 earliest dep: dec 7 2013 at 21:23
arrange(flights, desc(dep_delay))
arrange(flights, dep_delay)
3: Sort flights to find the fastest flights. Interpret fastest to mean shortest time in the air.
arrange(flights, air_time)
Optional challenge: fastest flight could refer to fastest air speed. Speed is measured in miles per hour but time is minutes. Arrange the data by fastest air speed.
arrange(flights, distance / air_time * 60)
4: Which flights travelled the longest? Which travelled the shortest?
Longest:4983 Shortet:17
arrange(flights, desc(distance))
arrange(flights, distance)
select()1: Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights. Find at least three ways.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, 4, 6, 7, 9)
select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
2: What happens if you include the name of a variable multiple times in a select() call?
If you repeat variables they get ignored
3: What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c(“year”, “month”, “day”, “dep_delay”, “arr_delay”)`
it is easier to use vectors than “”
4: Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains(“TIME”))
select(flights, contains("TIME"))
select(flights, contains("Time", ignore.case = FALSE))